Checking installation and loading packages

Before we can begin any script we first need to make sure that the required packages are installed in our version of RStudio. Next, we can load the required packages to be used in the script. The code block below will do this for you.

# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')

library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)

What do packages do?

You should be able to see that we have installed and loaded 3 different packages. Let’s first go over the basics of what a package is. In its simplest terms, a package is a toolbox that someone has created for us in R that makes our life easier. These packages build on the basic code that comes with the R programming language (what RStudio uses to run), called base R.

Figure 1: Opening an R package

Figure 1: Opening an R package

What do these packages do?

It is always a good idea to check the documentation for a package before you use it. We can do this by using the help syntax, which is the ?. The package we are trying to get help with is called here. Try to run this code by clicking on the green arrow on the corner of the code block on the left side of your screen. This will open a webpage that tells us the purpose of the here package and how it works.

Figure 2: Running code in R

Figure 2: Running code in R

?here #? loads the documentation for a specified package.

Activity 1 - Using help syntax

Fill in the code block below by putting in the help syntax ? and the name of the package you are interested in. This will get the documentation for the other packages we are using. You can do this by substituting in the packages that we are using from above. Have a read of each of these pages and click on any links you find interesting. These are the main packages we will be using throughout this course.

# Try to use the help function '?' to read more about the packages we are using today
# The packages we are using are 'tidyverse' and 'ggplot2'.

?tidyverse
?ggplot2

Importing your data

The dataset we are using has already been downloaded in the folder containing this R Markdown file. On your computer navigate to this folder and have a look at what it contains.

You should note that it contains the following:

These are the key ingredients needed to organise all projects in R.

Figure 3: Project Organisation

Figure 3: Project Organisation

You will notice that the data for today, called PSYC2001_social-media-data.csv, is a csv file (short for a Comma Separated Value file). This means that we will need to import the dataset using a function capable of importing csv files.

We will be using two different functions to achieve this. The read.csv() function is used to import our csv dataset and it comes from the utils package which is part of base R. But the read.csv() function needs to know where the file is coming from. To do this, we use the here() function from the here package. This function tells R the location of the project we are working from, to make locating the data easier.

Let’s first confirm that here() knows our current location on this pc (called the ‘Working Directory’)

here()
## [1] "G:/Current/Student folders/Bart Cool/Work/PSYC2001 in R/Tutorial 2 - Data wrangling and visualization"

We can use this to easily find where our file is located and read it.

social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in csv files
Warning: If you have an error, something has gone wrong—please ask your tutor for help!

Having a look at our imported data

Our data should now be imported into R!

The first thing we should do whenever we import data is to see how it looks in RStudio. There are a couple of ways to do this.

Figure 4: Navigating to dataset

Figure 4: Navigating to dataset

# Method 1 - Type in the name of the object
social_media
# Method 2 - Use the View function
View(social_media) #view automatically displays the dataset in a tab.
# Method 3 - Use the head function
head(social_media) #head displays the first 6 rows of each variable.
##   id  age time_on_social urban good_mood_likes bad_mood_likes followers
## 1 S1 15.2           3.06     1            22.8           46.5     173.3
## 2 S2 16.0           2.18     1            46.0           48.3     144.3
## 3 S3 16.8           1.92     1            50.8           46.1      76.5
## 4 S4 15.6           2.61     1            29.9           29.2     171.7
## 5 S5 17.1           3.24     1            37.1           52.4     109.5
## 6 S6 15.7           2.44     1            26.9           20.2     157.5
##   polit_informed polit_campaign polit_activism
## 1            2.3            3.2            3.6
## 2            1.6            2.2            2.6
## 3            1.9            2.7            3.0
## 4            1.6            2.3            2.6
## 5            2.0            2.9            3.3
## 6            2.4            3.4            3.9
# Method 4 - Use the str function
str(social_media) #displays an overall summary of the object and variable structure.
## 'data.frame':    60 obs. of  10 variables:
##  $ id             : chr  "S1" "S2" "S3" "S4" ...
##  $ age            : num  15.2 16 16.8 15.6 17.1 15.7 19.7 18.6 19.6 15.5 ...
##  $ time_on_social : num  3.06 2.18 1.92 2.61 3.24 2.44 1.46 1.52 1.92 2.1 ...
##  $ urban          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ good_mood_likes: num  22.8 46 50.8 29.9 37.1 26.9 14.8 26 6.5 45.7 ...
##  $ bad_mood_likes : num  46.5 48.3 46.1 29.2 52.4 20.2 35.1 35.8 12.2 32.8 ...
##  $ followers      : num  173.3 144.3 76.5 171.7 109.5 ...
##  $ polit_informed : num  2.3 1.6 1.9 1.6 2 2.4 1.7 1.6 1.5 2.2 ...
##  $ polit_campaign : num  3.2 2.2 2.7 2.3 2.9 3.4 2.4 2.2 2.1 3.1 ...
##  $ polit_activism : num  3.6 2.6 3 2.6 3.3 3.9 2.7 2.6 2.4 3.5 ...

You should now have a good idea of what PSYC2001_social-media.csv looks like in RStudio.

You will also notice that the last function, str(), displays a summary of the object. This includes:

Question: Please discuss with your deskmate and tutor what you think chr and num mean.
Figure 5: You thinking

Figure 5: You thinking


Checking the quality of our data

Once we have imported our dataset into R, it’s important to check the quality and structure of the data to ensure everything looks as expected. One simple way to do this is by using the summary() function.

summary(social_media) #summary provides a quick overview of the data in each variable. 
##       id                 age        time_on_social         urban    
##  Length:60          Min.   :13.90   Min.   :-999.000   Min.   :1.0  
##  Class :character   1st Qu.:15.70   1st Qu.:   1.920   1st Qu.:1.0  
##  Mode  :character   Median :16.50   Median :   2.365   Median :1.5  
##                     Mean   :16.87   Mean   : -30.845   Mean   :1.5  
##                     3rd Qu.:17.43   3rd Qu.:   3.042   3rd Qu.:2.0  
##                     Max.   :23.00   Max.   :   4.320   Max.   :2.0  
##  good_mood_likes bad_mood_likes    followers      polit_informed 
##  Min.   : 6.50   Min.   :12.20   Min.   : 61.40   Min.   :0.600  
##  1st Qu.:31.60   1st Qu.:39.08   1st Qu.: 76.47   1st Qu.:1.500  
##  Median :45.90   Median :49.30   Median :116.30   Median :1.800  
##  Mean   :43.04   Mean   :49.84   Mean   :124.76   Mean   :1.858  
##  3rd Qu.:53.40   3rd Qu.:58.75   3rd Qu.:153.75   3rd Qu.:2.200  
##  Max.   :89.20   Max.   :91.20   Max.   :336.50   Max.   :3.400  
##  polit_campaign  polit_activism 
##  Min.   :0.800   Min.   :0.900  
##  1st Qu.:2.100   1st Qu.:2.400  
##  Median :2.550   Median :2.900  
##  Mean   :2.602   Mean   :2.977  
##  3rd Qu.:3.100   3rd Qu.:3.500  
##  Max.   :4.800   Max.   :5.500
Question: Do you notice anything unusual in the output of this data ? Discuss with your neighbour and tutor
Hint: Take a closer look at the time_on_social variable.

Cleaning the data

It should now be clear that this data is unusual because it has a minimum value of -999 in the time_on_social variable which is measured in hours (we can’t have negative time !).

Figure 6: Back to the future !

Figure 6: Back to the future !

A good question to ask now is - why are these values in the dataset?

Sometimes when collecting data, we can’t get a response from every participant. Instead of leaving a blank, researchers will sometimes put in a placeholder value like -999 to show that the data is missing. These aren’t real numbers; they just mean the data wasn’t recorded. But -999 isn’t the standard way to show missing data in R. R uses NA to represent missing values, and that’s important because most R functions know how to handle NA properly—but they don’t know to ignore -999.

Lets first have a look at how many -999 values are present in the data. We can do this by using the filter() function from the tidyverse package which is used to keep (or remove) rows based on certain conditions. We can then use the count() function from the tidyverse package to sum the number of rows in the dataframe.

social_media_filtered <- filter(social_media, time_on_social == -999) #keep all rows where `time_on_social` is equal to -999
count(social_media_filtered) #count the total number of rows remaining the dataframe and print it to the console. 
##   n
## 1 2

Introducing Piping

A short aside to introduce a very specical operation called a ‘pipe’ or %>%. This operation allows you to pass the result from one function to the next seamlessly in a sort of assembly line like fashion. Throughout the rest of the course we will be using ‘piping’ as it is easier to follow and code. For instance, lets repeat what we just did above but with pipes instead.

social_media %>% #pass the values from social_media to the filter function
  filter(time_on_social == -999) %>% #keep all rows that are equal to -999 and pass the result to count
  count() #count the number of remaining columns
##   n
## 1 2
Info: Piping is not friends with every functions. Some functions will not accept inputs from pipes (not matter how nice they are !). This will become clearer as we code of this course

Now lets use a piping method to clean this data up and remove -999 and replace them with more R readable NA values.
We can do this using the mutate() and na_if() functions from the tidyverse package. The mutate() function is used to alter columnsin dataframe based on certain conditions and na_if() is used to replace given values with NA in a dataframe.

social_media_NA <- social_media %>%
  mutate(time_on_social = na_if(time_on_social,-999)) #mutate alters columns and rows.
                                                      #na_if replaces -999 with NA.

Data visualization using ggplot2

Now let’s look at some data! We’re going to start by visualising the time_on_social variable. Visualising helps us understand more about the distribution of the data, which helps us understand what kinds of analysis we can perform.

To do this we will need to use the ggplot() function. This is the main function from the ggplot2 package (you should know what this is from reading the documentation). ggplot() provides the canvas of the graph you want to make.

To make the basic canvas ggplot() requires two things:

  1. The data that you want it to plot.

  2. The variables to go on the x and y axes.

Importantly, ggplot() only provides the canvas. It does not draw anything by itself. You have to add layers to the canvas created by ggplot() by using other functions that can create bars, points or lines !

Here we use geom_boxplot() which creates a boxplot for us.

social_media_NA %>%
ggplot(aes(y = time_on_social),) + #ggplot uses aesthetic (aes()) to map axes. 
  scale_x_discrete() + #this tells ggplot that the x-axis is categorical.
  geom_boxplot() + #creates a boxplot
  labs(y = "Time on Social Media") #short for "labels", use to label axes and titles.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Warning: We receive a warning here because ggplot() is able to recognise and remove ‘NA’ values. Be careful as not all R functions are able to do this.
Question: What approximately is the median value? The lower quartile? The upper quartile? Is there another way that we could get this information in a more exact form ? Discuss this with your deskmate and your tutor.

Activity 2 - Creating a histogram in ggplot()

ggplot() can be customised with so many other functions that we have shown here to make truly beautiful looking plots. We will be learning how to do this throughout the next few weeks.

For now lets see if you can put some of the skills you have learned so far to good use. See if you can work out how to make a histogram of the data using the function geom_histogram()

Hint: You will only need to provide an x variable this time !
social_media_NA %>%
ggplot(aes(x = time_on_social)) + #ggplot uses aesthetic (aes()) to map axes. 
  geom_histogram() + #creates a histogram
  labs(x = "Time on social media", y = "Density") #short for "labels", use to label axes and titles.
Question: What conclusions would you draw about the shape of the data, given your histogram? Please discuss with your deskmate and tutor.

Well done ! You have completed everything you need to for this week. If you have finished in a record time please consult with your tutor about what to do next. Otherwise we will see you next lab !

Figure 6: Students reaction to this information !

Figure 6: Students reaction to this information !